Shotgun Metagenomic Data Analysis ◾ 325
annotations and polypeptides and ORFs are written to files. Gene annotation of a new
genome assembly is an important step. Since bacteria have no introns, prediction of ORFs
is easier than in the eukaryotic genome. There are many programs for ORF prediction, but
Prodigal [12] is the most commonly used one. We have installed Prodigal above. Prodigal
can predict ORFs in any genomic sequences. Thus, we can predict the ORFs in assemblies
separated by binning. In the following, we will predict the ORFs in one of the assemblies
recovered from the sample of the patient with severe sickle cell disease.
prodigal -a prod_out/healthy.faa \
-d prod_out/healthy.fnt \
-o prod_out/healthy.gbk \
-s prod_out/genes.gff \
-i binning/severe/severe.1.fa \
-p single
The “-a” option specifies the FASTA file name for the polypeptides or proteins translated
from the predicted ORFs. The “-d” option specifies the FASTA file name of the nucleotide
sequences that represent the predicted ORFs. The “-o” option specifies the predicted ORF
as features in GenBank format. The “-s” option specifies the gene annotation in GFF (gen-
eral feature format). The “-i” option specifies the input file which is the assembly. The “-p”
option specifies the procedure, which is either “single” for a single assembly or “meta” for
metagenomic assembly that may include genomes of multiple species.
8.3 SUMMARY
The metagenomic DNA is isolated from environmental samples or clinical samples in
which several microbes are present. Unlike targeted gene sequencing, shotgun metage-
nomic sequencing allows researchers to sequence the whole genomes of all organisms pres-
ent in a sample and to evaluate the microbial diversity and abundance.
Shotgun metagenomic sequencing attempts to sequence the whole genomes of a large
diverse number of microbes, each with a different genome size. Long reads produced by
PacBio and Oxford Nanopore are preferred. However, they usually have higher error rate
than the short reads. Since there are several species in the metagenomic sample, there
must be a sufficient sequencing depth to allow assembling the genomes of all species in the
sample.
Before analysis, we should make sure that we have fixed any quality problem by trim-
ming adaptors, filtering out low-quality reads, and removing technical sequences. In the
case of clinical samples, we should also remove the host DNA by aligning reads to the host
genome and then separate the unaligned reads in new FASTQ files to be used in the analy-
sis. There are two approaches for the shotgun metagenomic data analysis: the assembly-
free and de novo assembly. The assembly-free approach does not require assembling the
genomes of the species in the sample; it uses reads present in the metagenomic samples
to assign taxonomic groups by identifying unique genomic regions in the reads. Most of
the programs used for taxonomy assignment require a large amount of memory and stor-
age space. The second approach uses de novo algorithms to assemble the genomes of the